Transcriptomic and machine learning based molecular landscape provides a framework for better understanding estrogen receptor positive breast cancer

Author

Carlos Ronchi, Cathrin Brisken

Abstract

Estrogen receptor (ER) is a key driver of ER positive breast cancer (BC) and targeted by endocrine therapies. Among ER+ BC patients, one half of all recurrences happen after five years of endocrine therapy. In this paper we show that among ER+ BC patients that received only endocrine therapy, molecular estrogen signaling signatures are predictive on TCGA, METABRIC and SCANB (over 10000 patients in total). Even among tumors with ER IHC percentage greater than 90%, estrogen signatures are predictive, showing a discrepancy between ER IHC and functional ER signaling. In order to better understand the molecular biology of these tumors, we developed a new pipeline using unsupervised machine learning algorithms to integrate and interpret publicly available datasets in a single sample manner, such that patient samples can be used regardless of the cohort of origin and sequencing platform. To validate the pipeline, we used SCANB (RNAseq), normal mammary gland samples (RNAseq), patient derived xenografts (PDX, RNASeq) and data from the POETIC trial (microarray, ER+ only patients). We show that the molecular landscape, where samples are integrated and embedded together, captures the different PAM50 molecular subtypes and does not depend on clinical variables, such as age, tumor stage and tumor purity. Moreover, all patients from the POETIC trial whose embeddings in the molecular landscape are on the basal region were non-responders. We further show using the POETIC data that looking at the neighborhoods in the molecular landscape of a responder and non-responder we see differences in estrogen signaling, highlighting potential differences in the treatment. PI3K AKT MTOR signaling and G2M checkpoint are two pathways that are drivers of the molecular landscape and we showed that they are also predictive for RFS on SCANB and METABRIC, highlighting possible treatments with drugs currently in the market. Finally, we derive a new prognostic risk score based on sample position in the molecular landscape and on clinical factors that outperforms commercially available signatures and provides additional information when used in combination with the risk of recurrence (ROR). Thus, we show how the molecular landscape based on transcriptomics data could extend and help in the clinical practice.

Introduction

Breast cancer is the most diagnosed cancer worldwide (1). It is a very heterogeneous disease that can be subdivided in different histological and molecular subtypes. Patient tumors are classified based on the expression as detected by IHC of the estrogen, progesterone receptors and HER2 membrane protein. The estrogen receptor positive (ER+) breast cancer subtype represents more than 70% of all cases (2). Tumor cells of this subtype contain the estrogen receptor whose main mechanism is the target of selective estrogen receptor modulator (SERM), selective estrogen receptor degrader (SERD) and aromatase inhibitors (AI). To receive endocrine therapy, tumors need to be ER+, which according to the ASCO guidelines are defined as having at least 1% of the tumor cells expressing estrogen receptor (3), whereas the Swedish national guidelines uses a threshold of 10% for ER status (4). It has been shown that the patients in the low spectrum of ER positivity, from 1% to 10%, do not benefit as much as patients with 10% to 100% of ER positive cells in the tumor (5).

Transcriptomics and genomics have revolutionized cancer research. The new tools opened the option to better understand the molecular underpinnings of cancer biology. In breast cancer, molecular subtyping through the use of next generation sequencing technologies and qRT-PCR have been used to aid clinicians when taking the decision of giving chemotherapy for patients, avoiding potential overtreatment of patients. Researchers have developed gene signatures that are able to assign a risk score. Such score is based on survival data found in the literature and the higher the risk score of a patient, the more likely the recurrence, therefore the use of additional chemotherapy in first-line therapy is advised. Some of the risk scores are already recommended in international guidelines (10) which are already being used in the clinics (610). They are all based in small sets of genes that were obtained after several statistical procedures, usually using forward variable selection on cox regression. However, these signatures do not provide a complete explanation on why a patient should receive additional therapy, they are mostly prognostic. What is known is that they are composed of several submodules pathways, such as estrogen, proliferation and HER2 signaling and that some signatures are correlated to some of its submodules. For example, the recurrence score (RS) is associated with the estrogen module and the Risk of Recurrence (ROR) signature is correlated with its proliferation module (11). An alternative to the signatures only approach is to use RNA-sequencing in the clinics (12). The SCANB consortium showed that it is possible to use mutational and gene expression based biomarkers within one week of tumor surgery (1315) in the clinics. They also showed that it is possible to use RNA-seq to calculate the Prosigna risk scores and risk stratification (16), showing the potential all in one use of RNA-seq instead of using different assays from different companies.

Even though there are several new analytical techniques to understand the risk of recurrence for ER+ BC patients, there is still a lack of understanding of the patient’s tumor molecular biology. There are challenges to overcome in the pathway analysis for patients individually and how to compare patients molecularly when considering complete transcriptomics data. There is no tool that allows to integrate patients in a single sample way and that does not require large amounts of data. In most cases integration is a one step procedure and cannot be updated. The common tools used in the RNA-seq community for batch effect removal are (1719) and new samples cannot be integrated in a straightforward manner. The only way is to re-run the procedure together with the new sample. Such a procedure is not usually feasible as there are not enough samples to estimate batch effects across groups and therefore correct them.

We propose here a new pipeline for integrating publicly available datasets and developing a framework for personalized medicine and the understanding of molecular pathways at the patient level. By developing a normalization and embedding technique, we show that it is possible to integrate publicly available molecular datasets, such as TCGA, SCANB, METABRIC, microarray data of some patients from the POETIC trial, patient derived xenografts and healthy breast tissue (12,2023) into a common space, called molecular landscape. By using a special normalization and principal component analysis we can combine the datasets together explore the full molecular landscape of RNA-sequencing and microarray samples together. The new molecular landscape allows us to add new one single sample independently, showing its molecular subtype without the need of any other classification tool that has been validated in another dataset or centering techniques. It is possible to compare patients in a small neighborhood of the molecular landscape, providing a context to better understand the biology of the individual tumor. Moreover, we propose the use of different gene sets from the hallmarks collection of the Molecular Signature Database (MSigDB) (24,25), enabling understanding of individual patient tumor biology.

Alternative treatments are necessary for early stage ER+ BC patients. Currently chemotherapy and immunotherapy ared used as a second line therapy for these patients. Here we show that G2M checkpoint and PI3K AKT MTOR signaling pathways are important drivers of the molecular landscape and are predictive in METABRIC and SCANB among patients that received only endocrine therapy. Drugs are available for these pathways, but they are only approved for metastatic BC. Moreover, Novartis release a statement revealing the results of an interim analysis of the phase III NATALEE trial, showing the benefit of a CDK4/6 inhibitor (ribociclib) in combination with ET compared to ET only for early stage ER+ BC patients (26). Our data here further supports the results of this interim analysis.

To conclude the framework we also developed a risk score signature that depends not on a specific subset of genes but on the position of the molecular landscape and clinical factors. We show that when used in combination with the ROR score it is still statistically significant and provides additional information.

A R package is provided on github (github.com/chronchi/molecular_landscape) with instructions on how to use the newly created molecular landscape and how to integrate new samples and calculate the risk scores.

Methods

Cohorts

Publicly available breast cancer cohorts TCGA, SCANB, METABRIC and POETIC (12,2022) were used to calculate the principal component analysis (PCA) embeddings and validation. From TCGA, SCANB and METABRIC, only primary breast cancer samples were included. Each cohort has over 1000, 1000 and 8000 primary samples respectively, totalling over 10000 samples. The POETIC trial comprises of only post-menopausal women diagnosed with primary breast cancer and considered to have an estrogen receptor positive status (IHC) in the initial diagnosis.

TCGA was downloaded from Firebrowse. The version 3 of SCANB data (StringTie FPKM Gene Data unadjusted) was downloaded from Mendeley https://data.mendeley.com/datasets/yzxtxn4nmd (27). METABRIC was downloaded directly from cBioPortal. POETIC data was downloaded from GEO (accession code GSE105777) using the R package GEOquery. For details about downloading and preprocessing steps see https://chronchi.github.io/transcriptomics.

Selection of highly variable genes

In order to perform the PCA, we sub selected 1000 common and highly variable genes in the TCGA and METABRIC cohorts. For each gene and in each cohort separately, the coefficient of variation (CV) was calculated:

\[ \text{CV} = \frac{\text{standard deviation}}{\text{average expression level}}. \]

The 1000 genes with highest average CV were selected.

qPCR-like normalization

Given a list of 44 stable genes across different cancers (28) and 1000 genes selected, all 1044 genes were ranked from lowest to highest expression for each sample separately and the rankings were divided by the average ranking of the 44 stable genes.

PCA embedding

Using the normalized data as described in the qPCR-like normalization section, a total of 1000 random samples (fixed seed in R) from TCGA and METABRIC were selected to perform the initial PCA, using PCAtools (29) in R. PCA was performed without centering and scaling since the data is already centered and scaled for all genes and samples. The embedding for new individual samples is obtained by multiplying the loadings matrix with the sample normalized expression. For samples where not all the 1044 genes are available, normalization was performed and the missing genes were padded with 0.

Scoring strategies

For the 4 big cohorts, TCGA, SCANB, METABRIC and POETIC, GSVA (30) was applied along with the \(SET_{ER/PR}\) signature (31) and the hallmark collection from the molecular signature database (24,32). Default parameters were used in the gsva function from the GSVA package.

Average neighborhood scores

In order to calculate the posterior distribution of the average scores in each neighborhood, a linear regression with only intercept (score ~ 1) was fitted using rstanarm (33) and the function stan_glm for each pathway individually. When applying the function stan_glm, we used four chains and a prior normal distribution with location \(0\) and scale equals to \(1\). The package tidybayes (34) was used to extract the draws and put them in a tidy format.

Survival analysis

Survival analysis was done using cox regression with the survival package from R. Variables used for adjustment were age, tumor stage and number of positive lymph nodes for TCGA and SCANB, age and the Nottingham Prognostic Index (NPI) for METABRIC. Overall survival was performed for all three cohorts and recurrence free survival for METABRIC and SCANB. It could be that there are unmeasured confounders that are affecting the pathways and outcomes at the same time, leading to a biased result.

Risk score

We used cox regression on data of ER+ breast cancer positive patients that received only endocrine therapy. A subset of samples from METABRIC was used as a training cohort. The rest of the samples were used as a test to evaluate the prognostic value of the signature, SCANB was used as a validation cohort. The cox regression was calculated with the following covariates: PC3, PC4, tumor size, node status (negative if no node, positive otherwise) and age. Risk score was calculated as the sum of the coefficients multiplied by the respective values for each patient.

\[ \begin{aligned} -0.09 \times & \text{PC3} + 0.08 \times \text{PC4} + 0.02 \times \text{Tumor size} \\+ & 0.58 \times \text{Node status} + 0.003 \times \text{Age} \end{aligned} \]

In order to compare the risk score and its clinical utility we performed a likelihood ratio test with the full model (risk score, binary category coming from ROR, tumor size, age and node status) and the base model (without the risk score).

A risk score using only the principal components was calculated and the formula is provided below.

\[ -0.09 \times \text{PC3} + 0.09 \times \text{PC4} \]

Code availability

The code used to generate all the analysis is available on https://github.com/chronchi/molecular_landscape. To fully reproduce the images in this paper check the instructions on how to run the docker image and RStudio session at the github repository. An online version with a website containing all the analysis and code can be found on https://chronchi.github.io/molecular_landscape. To check the code in the previous link click in source code on each chapter separately.

Results

Estrogen receptor signaling is a clinical continuous variable

Currently ER status is used for the decision to administer endocrine therapy and it is considered binary with cutoff for ER+ defined as >1 to >10% of tumor cells expressing the ER by IHC. We hypothesized that a more functional readout as estrogen receptor (ER) signaling activity determined by scores from bulk RNA-seq samples is a more adequate measure among patients that received only endocrine therapy. To test this hypothesis we used the estrogen signatures HALLMARK_ESTROGEN_RESPONSE_EARLY and HALLMARK_ESTROGEN_RESPONSE_LATE that were extracted from the molecular signature database (24) (MSigDB) and \(SET_{ER/PR}\) (31). The signatures from MSigDB each contain 200 genes and the latter signature has 18 genes that are associated with estrogen and progesterone receptor signaling and applied them to three independent breast cancer RNA based datasets, TCGA, SCANB and METABRIC (16,20,21) to calculate estrogen signaling scores. Each of these cohorts comprises more than 1000 untreated patient tumors. The individual scores for each patient sample were ranging from -1 to 1 in each of the three cohorts ( Figure 1 (a)) stratified by estrogen receptor status. There is a difference in the scores between the estrogen receptor (ER) positive and negative subgroups for each cohort separately. The ER- has mostly negative ER signaling scores. On the other hand, the ER+ group has scores that range from negative to positive, covering the whole spectrum of possible scores (Figure 1 (a)).

To assess how the functional ER score related to ER IHC index, we used the SCANB dataset to calculate the correlation between the scores. There is a small correlation between the molecular ER signaling score and ER percentage (Figure 1 (b)). Among those patients with high ER IHC index (over 90%), the distribution of ER signaling scores is all over the spectrum, indicating a mismatch between IHC and the functional form of the ER signaling score (Figure 1 (c)).

Next we wanted to evaluate the predictive power of the molecular ER signaling scores, given that there is a mismatch between ER IHC percentage and molecular ER signaling score based on the SCANB dataset. Cox regression was used to determine the hazard ratio (HR) of the ER signaling in overall survival (OS) for TCGA, SCANB and METABRIC and recurrence free survival (RFS) for METABRIC and SCANB. Each survival analysis was done independently and adjusted for available clinical variables. Tumor size and number of lymph nodes were used for TCGA and SCANB cohorts. The Nottingham prognostic index (NPI) was used for METABRIC. Age was used in all cohorts as a clinical variable for adjustment. To better understand the effect of ER signaling in the clinics, samples from ER+ BC patients that received only endocrine therapy (for SCANB and METABRIC), or samples with the luminal A and B PAM50 molecular subtype (for TCGA). Such sampling may avoid a bias on the coefficients due to chemotherapy, as patients who receive chemotherapy tend to have a worse outcome compared to those that do not. In all the three cohorts, the hazard ratio for estrogen early (Figure 1 (d)) was below 1 (TCGA: 0.23, 0.05-0.99 CI; SCANB: 0.22, 0.10-0.45 CI; METABRIC: 0.56, 0.34-0.89 CI). This shows the continuous aspect of the functional ER signaling. For both hallmark estrogen response late and SET ER/PR, results were similar to estrogen response early (Table S1). In this case considering only the ER+ BC patients that received only ET we could argue that these signatures are both prognostic and preditictive of ET response.

Given that the ER IHC index could have an impact in the previous results, we only selected patients with high ER IHC index (> 90%) and re-performed survival analysis on the SCANB cohort (Figure 1 (e)). The HR for ER percentage is 1.01 (CI 0.88-1.03). But even among those patients the two estrogen signatures have a HR below 1 (OS: 0.26, 0.10-0.67; RFS: 0.48, 0.03-1.02). This illustrates that even among patients with high ER IHC index, there are those with low ER signaling scores indicating a poor sensitivity to ET.

Single sample integration preserves relevant breast cancer properties

Given that not all patients with high ER IHC percentage will respond the same to endocrine therapy, we aimed to better identify the non-responders based on molecular pathways and transcriptional signatures for a single patient. To this aim, we developed a single sample batch effect removal method to integrate microarray and bulk RNA-seq and create a molecular landscape, which is an embedding of the molecular data into a common space for all samples. The advantage of the method is when given a new sample, it can easily be integrated with all the other previous samples without retraining, therefore being possible to integrate in the clinical practice. The first step of the method is to normalize the samples to the same scale. The average ranking of housekeeping genes distribution is similar among the two RNA-seq datasets, SCAN-B and TCGA and their mean values are higher than the mean of the distribution on METABRIC (Figure S1 (a)). The normalization preserves the distinctions of gene expression levels between ER+ and ER- breast cancer samples, ESR1 and TFF1 being such examples (Figure S1 (b)).

The biplot in Figure 2 (a) with the third and fourth components from TCGA and METABRIC samples shows that the samples are overlapping across the two cohorts. All samples, including those used for training and validation, are plotted.

In order to check the robustness of the procedure, the analysis was repeated 10 times with 10 random subsets of patient samples from TCGA and METABRIC, simulating a cross validation process. Figure 2 (b) shows the embedding of some of the validation datasets. The blue dots correspond to the original embedding of the samples, the red dots to the new embedding and the lines are connecting the same sample. Whenever a new embedding is performed, all samples are either shifted in the same direction or they are reflected along a common axis, showing the invariance properties of the embedding. (Figure 2 (b)).

Missing genes are a problem inherent to publicly available datasets. To test the impact of the missing genes in the embedding, genes were ranked based on their loadings of the third and fourth component and then we removed 200 genes in total with a varying proportion of the top loading genes (ranging from 0 to 100% with a 5% step). The higher the proportion the less precise the embedding is, with the embedding converging towards the origin, i.e., the (0,0) coordinates (Figure 2 (c)). If the proportion of top genes missing are below 30% (60 genes) the embedding is not heavily impacted (Figure 2 (c)).

The third component corresponds to the separation between ER+ and ER- BC patients in both cohorts (Figure 2 (d)). A combination of the third and fourth components shows a good distinction among the PAM50 molecular subtypes (Figure 2 (e)), showing that the embedding captures important molecular information. The fourth component mostly divides the luminal A and luminal B subtypes, whereas the normal-like subtype is spread across the third and fourth component. Since the normal samples are both close to the luminal A and basal subtypes in terms of the euclidean distance, one should only compare samples that are close to each other for pathway analysis. To further characterize the third component, we colored the molecular landscape by the \(SET_{ER/PR}\) signature, showing a gradient with high values on the right and low values on the left (Figure 2 (f)). Other clinical factors, such as tumor stage, node stage, age, NPI and tumor purity show no influence in the embedding (Figure S2).

Embedding generalizes to a validation cohort

So far only samples from TCGA and METABRIC were used in the training and projection. SCANB was used as an external validation cohort. Similar to the previous results, by inspecting the third and fourth components SCANB overlapped with METABRIC and TCGA (Figure 3 (a)). ER+ and ER- BC patients are well separated (Figure 3 (b)) and the procedure can also distinguish the molecular subtypes (Figure 3 (c)) on top of the other samples coming from TCGA and METABRIC. We show that the embedding works for the SMC cohort (35) (Figure S3). As an RNA-seq cohort, it is expected that SCANB samples will be closer to TCGA than to METABRIC when removing batch effects, mostly due to platform. Biplot of PC1 and PC2 (Figure 3 (d)) shows that SCANB is closer to TCGA than to METABRIC.

Together, these findings suggested that our approach is robust. Next we also the molecular landscape with patient derived xenografts (PDX) (23). These PDXs are breast cancer cells derived from patient tumor samples obtained from clinics and injected into the mammary gland of mice (36). By using the MIND model, we can engraft cells coming from ER+ BC tumors. Moreover, to have a succesfull engraftment rate, the cells need to grow so they can establish across the ducts. Therefore, the cells when extracted for RNA-sequencing are usually in a proliferative state. Figure 3 (e) shows the embedding of six different PDX samples with multiple biological replicates (23). These samples are in the scattered around the luminal B region, showing that the landscape captures the biology of the experiment and it shows the intertumor heterogeneity among ER+ BC patients in a research environment. Moreover, we used 66 samples coming from women that performed reduction mammoplasty in Switzerland. Figure 3 (f) shows that these samples are in the region of the normal-like PAM50 molecular subtype, as expected.

Molecular landscape is a tool to understand and explore patient heterogeneity

Since the molecular landscape relies in a single sample to obtain the embedding, we can use samples from any other cohort and compare to clinical factors. To provide a use of the molecular landscape in the clinical practice, we used the POETIC trial microarray data. The POETIC trial (37) was a trial that evaluated the use of perioperative aromatase inhibitors in ER+, postmenopausal BC patients. Its primary endpoint was time to recurrence. Matched samples from treated and untreated patients at baseline (before treatment) and at surgery (after an average 14 days of treatment) (22) were sent for microarray hybridization. The patients have matched Ki67 percentage levels, which can be considered an indication of how well a patient responded to the endocrine therapy. Patients with more than 5% of baseline Ki67 and a reduction of 60% upon endocrine therapy are considered responders, otherwise they are defined as non responders.

In order to gain insights on the differences between responders and non responders, we embedded the POETIC trial samples using the procedure and studied molecular landscape (Figure 4 (a) left). The samples are spread across the whole molecular landscape, consistent with patients having different molecular biological properties. Furthermore, the POETIC samples are embedded closer to the METABRIC samples (Fig S4). Given the available information, the BC patients that are ER+ and in the left part of the landscape (ER- patients), are all non responders (Figure 4 (a) right). Moreover, some of the non responder patients are actually in the basal cluster (Figure 4 (c)). We selected two samples, from patients that were responder and non responder and are close in the molecular landscape (Figure 4 (c)) to highlight their molecular differences and see what is in their context. Figure 4 (d) shows the average posterior distribution of the neighborhood for the responder patient. The responder patient has a ER signaling score higher than the average. On the other hand, the non responder has a smaller ER signaling score than the average (Figure 4 (e)) and also a higher androgen signaling score (Androgen response). Other pathways, such as EMT, E2F targets, P53 and TGF\(\beta\) signaling along with their average posterior distributions are shown for both patients.

Generating hypothesis: subgroups and alternative treatments

A key aspect of precision medicine is finding the right patients for targeting with the proper drug. With the molecular landscape we can try to identify different subgroups, based on molecular pathway scores. We identify several pathways driving the distinction between samples in the embedding, such as G2M checkpoint, epithelial to mesenchymal transition (EMT), DNA repair and PI3K AKT MTOR signaling (Figure 5 (a)). The scores go from one direction to other across the two different principal components, indicating that they complement each other when capturing information. Not only G2M checkpoint is differentiating luminal A and B patients, EMT also is.

The overall and recurrence free survival calculated from patients that have ER+ BC, received only endocrine therapy (METABRIC: OS/RFS and SCANB: OS/RFS) and have \(PC3 > 0\) show that G2M checkpoint has a hazard ratio higher than 1 (Figure 5 (b); Table S2) and a tight confidence interval, which reflects the fact the higher the proliferation the worse the outcome, and reflects the subtyping differences between luminal A and B. The PI3K AKT MTOR signaling has a hazard ratio higher than 1 with a wider confidence interval, indicating a possible subgroup among all selected patients that may benefit from additional adjuvant therapy targeting this specific pathway.

Development of a new risk score and validation

We hypothesized that the molecular landscape contains clinically relevant information and that it could either complement or outperform the use of commercially available gene signatures. We used the third and fourth components separately in a cox regression, adjusted by the clinical variables mentioned in the methods section. In both overall survival and recurrence free survival analysis, the PC3 had a hazard ratio below 1 for the METABRIC, SCANB and SCANB high ER IHC percentage (more than 90%) BC patients that received only endocrine therapy. On the other hand, PC4 had hazard ratios higher than 1 in all cohorts for both OS and RFS (Figure 6 (a)). To further validate the clinical aspect of the molecular landscape, we calculated the euclidean distance between the surgery and baseline matched samples of the POETIC trial data. Consistent with our hypothesis that the position of in the molecular landscape is clinically relevant, the average distance of the embedding position is higher in responders than non-responders (Figure 6 (b)).

To further evaluate the clinical importance of the positions in the molecular landscape, we developed a new risk score based on clinical variables (tumor size, age and node status) and the position in the molecular landscape (PC3 and PC4 scores). Table 1 shows the results of the recurrence free survival analysis performed using these variables as coefficients and ER+ BC patients treated with ET only. The risk score developed follows a gradient from the bottom right of the molecular landscape to the upper left of the embedding (Figure 6 (c)) among all the patients treated with endocrine therapy only and that have ER+ BC.

To provide aditional benefit based on the scores in the To further validate the clinical utility of the score, we compared the new developed risk score with the risk of recurrence score available from the SCANB patients calculated by the nearest centroid technique (16). We see that there is a positive correlation among the scores, moreover there is a population of luminal A patients that present a high risk score that is not captured by the ROR, as by definition luminal A patients are of low/intermediate risk. We further showed that by using the binary categorization of the ROR into high and low/intermediate risk groups (16) there is additional benefit in using the new risk score in a survival analysis (p-value = 0.02). Figure 6 (e) shows that there is a distinction in the distributions of the low/intermediate and high risk patients, but even among the low/intermediate there are those with risk scores similar to those of high risk, which are by majority luminal B. As another way of validating the risk score, we calculated the risk scores using only the principal components on the POETIC trial data. Responders have a decrease in average in the risk score. Non-responders have almost the same risk score (Figure 6 (f)).

Discussion

Personalized medicine is a key topic in medicine and BC (23). The goal of better understanding the molecular underpinnings of the diseases leads to a better allocation of treatments and resources in the patient care. This has been shown to be necessary by using PDX mouse models, where different PDXs respond differently to hormone treatments (23). We have shown here a possible framework to deal with personalized medicine in breast cancer in general with a focus on ER+ BC patients.

Here we show using cox regression and multiple cohorts (12,20,21) with both RNA-seq and microarray data, that the function ER signaling score is a continuous predictive marker. Among the high ER IHC index patients (>90%), the ER score is still a predictive maker, highlighting the added benefit of using transcriptomic assays to understand the biology of the tumors. This can aid when evaluating treatment options and avoiding overtreatment. The functional ER scores presented here might be associated with the commercial signatures, as it has been shown that OncotypeDX’s estrogen module is highly correlated with the general OncotypeDX signature (11).

Integrating molecular data stemming from different sources is a challenge. On one hand batch effect tools are usually able to remove the batch effects across the different sources of variability (17,18), on the other hand they are not single sample based, meaning each time a new sample comes the algorithm needs to be run again. It is also based on the fact one has enough data in the different datasets, otherwise it skews the possible integration towards one of the datasets. Here we show by using TCGA, METABRIC and SCANB that it is possible to integrate the samples from these cohorts in a single sample manner. The samples show good mixing when using test samples not seen during the training stage. The embeddings preserve key molecular features of breast cancer. PC3 is clearly driven by estrogen receptor signaling where from right to left there is a gradient in the ER scores, such as Estrogen early and \(SET_{ER/PR}\). On the other hand, PC4 is what makes a difference between the molecular subtypes luminal A and B, which in practice differ by proliferation status in terms of Ki67 levels (38).

Moreover, the embeddings in the validation set (SCANB) preserve the key features of breast cancer. Samples are projected by their different PAM50 molecular subtypes and there is a gradient of estrogen signaling pathway from ER+ BC towards ER- BC patients in all the three cohorts combined. The first two components reflect batch effects. Expectedly, SCANB projected closer to TCGA, since both datasets are RNA-seq. On the other hand, part of the POETIC trial was sequenced using microarray dataset (22,37), and we show that it is projected closer to METABRIC as expected according to the first two components. This highlights that the method can capture information from different technologies and remove such batch effects. Moreover all the samples were sent for sequencing in different contexts at different times and populations, showing the power of the embedding method in removing batch effects and capturing truly the biology of breast cancer.

Sometimes when dealing with publicly available datasets, not all of the genes are available due to ethical protocols (12), pre-processing or technical reasons. We showed how robust the projection is to missing genes with high loadings in the projection. If less than 30% of the top genes (ranked by the loadings of the third and fourth principal components) are missing, we can recover the position of the embedding if all the. The more genes that are missing, the closer the projection will be to the origin, i.e., the (0,0) coordinate in the embedding of the third and fourth components of the PCA.

To show the clinical validity of the projection, we used a subset of patients from the POETIC trial with microarray and clinical information (22). The samples projected as expected and surprisingly the samples that are considered to be non responders upon 2 weeks of aromatase inhibition are projected among the ER- BC patients. When looking further upon two different patients that have similar embedding but different response to endocrine therapy, we see that the responder has a higher value of estrogen signaling than the average. On the other hand, the non responder patient has a smaller estrogen signaling score than the average, which suggests a possible explanation for the difference in response. Moreover, estrogen and androgen receptor signaling have been shown to be tightly linked (39) and these two patients have different androgen signaling scores. The non responder has a higher score than the average compared to the responder, whose score is just the same as the average.

Complementing the personalized approach, the molecular landscape can be used as a starting point to understand breast cancer molecular subgroups in a more intuitive way using pathways. Using survival analysis we show that some of the pathways that are important for the embedding can be used for subgrouping and hypothesis generation. Current clinical trials (40) are evaluating the use of PI3K-AKT-MTOR inhibitors in advanced metastatic BC patients, here we show weak evidence on the molecular level that such treatment could benefit early stage primary BC patients.

A weakness of the proposed pipeline is that we rely on GSVA scores, which can be used and compared across different cohorts since they have a representation of all molecular subtypes. A way to circumvent this is by having a small library of RNAseq or microarray samples that are representative of the patient population. This way when scoring a new patient, the scores can be compared across different cohorts. There is still a cost barrier for using RNAseq dataset in the clinical setting, but efforts are being made across the industry to reduce the sequencing costs and make it more widespread. An example is Alithea genomics, a company aiming to provide large-scale RNA-sequencing by using Bulk RNA barcoding and sequencing (BRB-seq).

With the development of new therapies, the use of better tools to empower clinicians is necessary. We believe that RNA-seq is a very powerful technique that can be used in clinical practice. The SCANB team (12,16) has shown how feasible it is to have a pipeline set up for RNA-sequencing. We have shown here how all the data that was generated can be combined and used both for clinical practice and also for research purposes, highlighing the power of RNA-seq.

References

Figure 1: Scores and survival analysis results from TCGA, SCANB and METABRIC cohorts. (a) GSVA scores for the SET ER/PR signature for each cohort. Each point corresponds to a patient sample and they are divided by estrogen receptor status. (b) Correlation of ER percentage with molecular signatures on the SCANB cohort. (c) Distribution of the molecular scores among patients with high ER percentage (more than 90%) from SCANB. (d) Forest plot of the survival analysis for each cohort separately. (e) Forest plot of the survival analysis for high ER percentage patients that were treated only with endocrine therapy from SCANB. NPI: Nottingham prognostic index. Ti: i-th stage of tumor. Ni: i lymph nodes with breast cancer cells.

Figure 2: (a) Biplot using the third and fourth components on TCGA and METABRIC samples. Colored by cohort. (b) Embedding of random samples given different training sets for PCA. Blue dots correspond to the original embedding of a sample and red dots correspond to the new embedding given the new training set. (c) Biplot of all possible embeddings of sample given a certain proportion of top loadings missing in the dataset. (d) Same as a, colored by ER status. (e) Same as a, colored by PAM50 molecular subtype. (f) Hex grid calculated on the biplot of the fourth and third component. Each hex is colored based on its average value of the estrogen response early.

Figure 3: Validation of the molecular landscape with an external cohort (a) Biplot using the third and fourth components and now including all samples from the three cohorts: TCGA, METABRIC and SCANB. (b) Same as a, colored by ER status. (c) Same as a, colored by PAM50 molecular subtype. (d) Biplot using the first and second component of TCGA, METABRIC and SCANB. (e) Biplot of the PDXs on top of all three big cohorts. (f) Biplot of the normal samples on top of the three big cohorts colored by PAM50 molecular subtype.

Figure 4: Embedding of the POETIC cohort into the molecular landscape and pathway analysis for patient samples. (a) Biplots of the POETIC samples (baseline and surgery) into the molecular landscape. Left plot is colored by cohort and right plot is colored by ER status. (b) Biplot of the POETIC sample colored by response status. (c) Biplot highlighting two patients with similar embedding and different response status. (d) Posterior distributions of the average scores in the neighborhood of the responder patient. Red line corresponds to the patient score. (e) Posterior distributions of the average scores in the neighborhood of the non-responder patient. Red line corresponds to the patient score.

Figure 5: Average scores in hex regions using all three cohorts together and survival analysis in each cohort individually. (a) Biplots of all samples from TCGA, SCANB and METABRIC that were grouped in small hex regions. Colors are depicted as the average score value in each hex region. Selected pathways are shown. (b) Survival analysis results obtained for each pathway individually. The pathways were adjusted as described in the methods section. CI: confidence interval.

Figure 6: Motivation for the development of risk score and validation of the results. (a) Forest plot of PC3 and PC4 for SCANB and METABRIC overall survival and recurrence free survival. (b) Average posterior distribution of the differences in position of the embedding for the two groups, responders and non-responders, in the POETIC trial dataset. (c) Hex plot of the risk scores from the SCANB ER+ BC patients treated with only endocrine therapy. (d) Risk score developed by the risk of recurrence (ROR) calculated using the nearest centroid algorithm and colored by the PAM50 molecular subtype. (e) Risk score comparison between the high risk and low/intermediate risk based on ROR. (f) Comparison of the risk scores from POETIC trial patients based on the principal components only and colored by response status. Risk scores were calculated for baseline and surgery samples. Each dot corresponds to one patient sample.

Table 1: Coefficients of the recurrence free survival analysis performed using cox regression on a subset of METABRIC samples coming from ER+ BC patients treated with only ET. CI: confidence interval.
Coefficient Hazard Ratio Std error p-value CI low CI high
PC3 0.91 0.0273 7.8e-04 0.87 0.96
PC4 1.08 0.0226 3.6e-04 1.04 1.13
tumor_size 1.02 0.0043 3.7e-07 1.01 1.03
node_statuspos 1.79 0.1476 7.9e-05 1.34 2.39
age 1.00 0.0070 6.5e-01 0.99 1.02

References

1.
World Health Organization: Regional Office for Europe. World cancer report. IARC; 2020.
2.
Anderson WF, Chatterjee N, Ershler WB, Brawley OW. Estrogen receptor breast cancer phenotypes in the surveillance, epidemiology, and end results database. Breast Cancer Research and Treatment [Internet]. Springer Science; Business Media LLC; 2002;76:27–36. Available from: https://doi.org/10.1023/a:1020299707510
3.
Allison KH, Hammond MEH, Dowsett M, McKernin SE, Carey LA, Fitzgibbons PL, et al. Estrogen and progesterone receptor testing in breast cancer: ASCO/CAP guideline update. Journal of Clinical Oncology [Internet]. American Society of Clinical Oncology (ASCO); 2020;38:1346–66. Available from: https://doi.org/10.1200/jco.19.02309
4.
5.
Dieci MV, Griguolo G, Bottosso M, Tsvetkova V, Giorgi CA, Vernaci G, et al. Impact of estrogen receptor levels on outcome in non-metastatic triple negative breast cancer patients treated with neoadjuvant/adjuvant chemotherapy. npj Breast Cancer [Internet]. Springer Science; Business Media LLC; 2021;7. Available from: https://doi.org/10.1038/s41523-021-00308-7
6.
Cardoso F, Veer LJ van’t, Bogaerts J, Slaets L, Viale G, Delaloge S, et al. 70-gene signature as an aid to treatment decisions in early-stage breast cancer. New England Journal of Medicine [Internet]. Massachusetts Medical Society; 2016;375:717–29. Available from: https://doi.org/10.1056/nejmoa1602253
7.
Parker JS, Mullins M, Cheang MCU, Leung S, Voduc D, Vickery T, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncology [Internet]. American Society of Clinical Oncology (ASCO); 2009;27:1160–7. Available from: https://doi.org/10.1200/jco.2008.18.1370
8.
Sparano JA, Gray RJ, Makower DF, Pritchard KI, Albain KS, Hayes DF, et al. Adjuvant chemotherapy guided by a 21-gene expression assay in breast cancer. New England Journal of Medicine [Internet]. Massachusetts Medical Society; 2018;379:111–21. Available from: https://doi.org/10.1056/nejmoa1804710
9.
Filipits M, Rudas M, Jakesz R, Dubsky P, Fitzal F, Singer CF, et al. A new molecular predictor of distant recurrence in ER-positive, HER2-negative breast cancer adds independent information to conventional clinical risk factors. Clinical Cancer Research [Internet]. American Association for Cancer Research (AACR); 2011;17:6012–20. Available from: https://doi.org/10.1158/1078-0432.ccr-11-0926
10.
Jankowitz RC, Cooper K, Erlander MG, Ma X-J, Kesty NC, Li H, et al. Prognostic utility of the breast cancer index and comparison to adjuvant! Online in a clinical case series of early breast cancer. Breast Cancer Research [Internet]. Springer Science; Business Media LLC; 2011;13. Available from: https://doi.org/10.1186/bcr3038
11.
Buus R, Sestak I, Kronenwett R, Ferree S, Schnabel CA, Baehner FL, et al. Molecular drivers of onco\(\less\)i\(\greater\)type\(\less\)/i\(\greater\) DX, prosigna, EndoPredict, and the breast cancer index: A TransATAC study. Journal of Clinical Oncology [Internet]. American Society of Clinical Oncology (ASCO); 2021;39:126–35. Available from: https://doi.org/10.1200/jco.20.00853
12.
Saal LH, Vallon-Christersson J, Häkkinen J, Hegardt C, Grabau D, Winter C, et al. The sweden cancerome analysis network - breast (SCAN-b) initiative: A large-scale multicenter infrastructure towards implementation of breast cancer genomic analyses in the clinical routine. Genome Medicine [Internet]. Springer Science; Business Media LLC; 2015;7:20. Available from: https://doi.org/10.1186/s13073-015-0131-9
13.
Dihge L, Vallon-Christersson J, Hegardt C, Saal LH, Häkkinen J, Larsson C, et al. Prediction of lymph node metastasis in breast cancer by gene expression and clinicopathological models: Development and validation within a population-based cohort. Clinical Cancer Research [Internet]. American Association for Cancer Research (AACR); 2019;25:6368–81. Available from: https://doi.org/10.1158/1078-0432.ccr-19-0075
14.
Brueffer C, Gladchuk S, Winter C, Vallon-Christersson J, Hegardt C, Häkkinen J, et al. The mutational landscape of the \(\less\)scp\(\greater\)SCAN\(\less\)/scp\(\greater\) -b real-world primary breast cancer transcriptome. EMBO Molecular Medicine [Internet]. EMBO; 2020;12. Available from: https://doi.org/10.15252/emmm.202012118
15.
Dahlgren M, George AM, Brueffer C, Gladchuk S, Chen Y, Vallon-Christersson J, et al. Preexisting somatic mutations of estrogen receptor alpha (\(\less\)i\(\greater\)ESR1\(\less\)/i\(\greater\)) in early-stage primary breast cancer. JNCI Cancer Spectrum [Internet]. Oxford University Press (OUP); 2021;5. Available from: https://doi.org/10.1093/jncics/pkab028
16.
Staaf J, Häkkinen J, Hegardt C, Saal LH, Kimbung S, Hedenfalk I, et al. RNA sequencing-based single sample predictors of molecular subtype and risk of recurrence for clinical assessment of early-stage breast cancer. npj Breast Cancer [Internet]. Springer Science; Business Media LLC; 2022;8. Available from: https://doi.org/10.1038/s41523-022-00465-3
17.
Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nature Biotechnology [Internet]. Springer Science; Business Media LLC; 2014;32:896–902. Available from: https://doi.org/10.1038/nbt.2931
18.
Zhang Y, Parmigiani G, Johnson WE. ComBat-seq: Batch effect adjustment for RNA-seq count data. NAR Genomics and Bioinformatics [Internet]. Oxford University Press (OUP); 2020;2. Available from: https://doi.org/10.1093/nargab/lqaa078
19.
Fei T, Zhang T, Shi W, Yu T. Mitigating the adverse impact of batch effects in sample pattern detection. Birol I, editor. Bioinformatics [Internet]. Oxford University Press (OUP); 2018;34:2634–41. Available from: https://doi.org/10.1093/bioinformatics/bty117
20.
Comprehensive molecular portraits of human breast tumours. Nature [Internet]. Springer Science; Business Media LLC; 2012;490:61–70. Available from: https://doi.org/10.1038/nature11412
21.
Curtis C, Sohrab P. Shah and, Chin S-F, Turashvili G, Rueda OM, Dunning MJ, et al. The genomic and transcriptomic architecture of 2, 000 breast tumours reveals novel subgroups. Nature [Internet]. Springer Science; Business Media LLC; 2012;486:346–52. Available from: https://doi.org/10.1038/nature10983
22.
Gao Q, Elena López-Knowles and, Cheang MCU, Morden J, Ribas R, Sidhu K, et al. Impact of aromatase inhibitor treatment on global gene expression and its association with antiproliferative response in ER\(\mathplus\) breast cancer in postmenopausal patients. Breast Cancer Research [Internet]. Springer Science; Business Media LLC; 2019;22. Available from: https://doi.org/10.1186/s13058-019-1223-z
23.
Scabia V, Ayyanan A, Martino FD, Agnoletto A, Battista L, Laszlo C, et al. Estrogen receptor positive breast cancers have patient specific hormone sensitivities and rely on progesterone receptor. Nature Communications [Internet]. Springer Science; Business Media LLC; 2022;13. Available from: https://doi.org/10.1038/s41467-022-30898-0
24.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences [Internet]. Proceedings of the National Academy of Sciences; 2005;102:15545–50. Available from: https://doi.org/10.1073/pnas.0506580102
25.
Liberzon A, Subramanian A, Pinchback R, Thorvaldsdottir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics [Internet]. Oxford University Press (OUP); 2011;27:1739–40. Available from: https://doi.org/10.1093/bioinformatics/btr260
26.
Novartis Kisqali® Phase III NATALEE trial meets primary endpoint at interim analysis demonstrating clinically meaningful benefit in broad population of patients with early breast cancer. https://www.novartis.com/news/media-releases/novartis-kisqali-phase-iii-natalee-trial-meets-primary-endpoint-interim-analysis-demonstrating-clinically-meaningful-benefit-broad-population-patients-early-breast-cancer; 2023.
27.
Johan Vallon-Christersson. RNA sequencing-based single sample predictors of molecular subtype and risk of recurrence for clinical assessment of early-stage breast cancer [Internet]. Mendeley; 2023. Available from: https://data.mendeley.com/datasets/yzxtxn4nmd/3
28.
Bhuva DD, Cursons J, Davis MJ. Stable gene expression for normalisation and single-sample scoring. Nucleic Acids Research [Internet]. Oxford University Press (OUP); 2020;48:e113–3. Available from: https://doi.org/10.1093/nar/gkaa802
29.
Blighe K, Lun A. PCAtools: PCAtools: Everything principal components analysis [Internet]. 2022. Available from: https://github.com/kevinblighe/PCAtools
30.
Hänzelmann S, Castelo R, Guinney J. GSVA: Gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics [Internet]. Springer Science; Business Media LLC; 2013;14. Available from: https://doi.org/10.1186/1471-2105-14-7
31.
Sinn BV, Fu C, Lau R, Litton J, Tsai T-H, Murthy R, et al. SETER/PR: A robust 18-gene predictor for sensitivity to endocrine therapy for metastatic breast cancer. npj Breast Cancer [Internet]. Springer Science; Business Media LLC; 2019;5. Available from: https://doi.org/10.1038/s41523-019-0111-0
32.
Mootha VK, Lindgren CM, Eriksson K-F, Subramanian A, Sihag S, Lehar J, et al. PGC-1\(\upalpha\)-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics [Internet]. Springer Science; Business Media LLC; 2003;34:267–73. Available from: https://doi.org/10.1038/ng1180
33.
Goodrich B, Gabry J, Ali I, Brilleman S. Rstanarm: Bayesian applied regression modeling via Stan. [Internet]. 2022. Available from: https://mc-stan.org/rstanarm/
34.
Kay M. tidybayes: Tidy data and geoms for Bayesian models [Internet]. 2022. Available from: http://mjskay.github.io/tidybayes/
35.
Kan Z, Ding Y, Kim J, Jung HH, Chung W, Lal S, et al. Multi-omics profiling of younger asian breast cancers reveals distinctive molecular signatures. Nature Communications [Internet]. Springer Science; Business Media LLC; 2018;9. Available from: https://doi.org/10.1038/s41467-018-04129-4
36.
Sflomos G, Dormoy V, Metsalu T, Jeitziner R, Battista L, Scabia V, et al. A preclinical model for ER\(\upalpha\)-positive breast cancer points to the epithelial microenvironment as determinant of luminal phenotype and hormone response. Cancer Cell [Internet]. Elsevier BV; 2016;29:407–22. Available from: https://doi.org/10.1016/j.ccell.2016.02.002
37.
Dowsett M, Smith I, Robertson J, Robison L, Pinhel I, Johnson L, et al. Endocrine therapy, new biologicals, and new study designs for presurgical studies in breast cancer. JNCI Monographs [Internet]. Oxford University Press (OUP); 2011;2011:120–3. Available from: https://doi.org/10.1093/jncimonographs/lgr034
38.
Arima N, Nishimura R, Osako T, Okumura Y, Nakano M, Fujisue M, et al. Ki‑67 index value and progesterone receptor status can predict prognosis and suitable treatment in node‑negative breast cancer patients with estrogen receptor‑positive and HER2‑negative tumors. Oncology Letters [Internet]. Spandidos Publications; 2018; Available from: https://doi.org/10.3892/ol.2018.9633
39.
Hickey TE, Selth LA, Chia KM, Laven-Law G, Milioli HH, Roden D, et al. The androgen receptor is a tumor suppressor in estrogen receptorpositive breast cancer. Nature Medicine [Internet]. Springer Science; Business Media LLC; 2021;27:310–20. Available from: https://doi.org/10.1038/s41591-020-01168-7
40.
AstraZeneca S. A Phase III Double-blind Randomised Study Assessing the Efficacy and Safety of Capivasertib + Fulvestrant Versus Placebo + Fulvestrant as Treatment for Locally Advanced (Inoperable) or Metastatic Hormone Receptor Positive, Human Epidermal Growth Factor Receptor 2 Negative (HR+/HER2-) Breast Cancer Following Recurrence or Progression On or After Treatment With an Aromatase Inhibitor. https://clinicaltrials.gov/ct2/show/NCT04305496; 2020.